Case Study: Boston Real Estate¶


Overview¶

  • Programming language: Python.
  • Tools: NumPy, pandas, Matplotlib, Seaborn, and Plotly Express in Jupyter Notebook.
  • Skills: data cleaning, data analysis, and data visualization.
  • Objective: To provide insight into the Boston, Massachusetts real estate market and answer the following questions regarding the community.
    1. How would you visualize the median value of owner-occupied homes using two different methods of plotting?
    2. How would you visualize the number of houses bounded and not bounded by the Charles River with two different methods of plotting? Is there a significant difference in the median value of these houses?
    3. Is there a difference in the median values of houses for each proportion of owner-occupied units built before 1940?
    4. Is there a relationship between nitric oxide concentrations and the proportion of non-retail business acres per town?
    5. What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner-occupied homes?
  • Dataset: housing information and prices derived from the United States Census Service.
  • Dataset variables:
    • CRIM: per capita crime rate by town.
    • ZN: proportion of residential land zoned for lots over 25,000 square feet.
    • INDUS: proportion of non-retail business acres per town.
    • CHAS: whether the tract bounds the Charles River (1 if tract bounds river, 0 if the tract does not).
    • NOX: nitric oxides concentration (parts per 10 million).
    • RM: average number of rooms per dwelling.
    • AGE: proportion of owner-occupied units built prior to 1940.
    • DIS: weighted distances to five Boston employment centers.
    • RAD: index of accessibility to radial highways.
    • TAX: full-value property-tax rate per $10,000.
    • PTRATIO: pupil-teacher ratio by town.
    • LSTAT: percentage lower status of the population.
    • MEDV: Median value of owner-occupied homes in units of $1,000.
  • This project was initially completed as part of Coursera's IBM "Statistics for Data Science with Python" course. I independently rewrote, refined and extended it by adding additional code, analyses, and visualizations.

Table of Contents¶

  • Set Up
  • Data Exploration
  • Objective #1
    • Visualize the median value of owner-occupied homes using two different methods of plotting.
  • Objective #2
    • Visualize the number of houses bounded and not bounded by the Charles River with two different methods of plotting. Is there a significant difference in the median value of these houses?
  • Objective #3
    • Is there a difference in the median values of houses for each proportion of owner-occupied units built before 1940?
  • Objective #4
    • Is there a relationship between nitric oxide concentrations and the proportion of non-retail business acres per town?
  • Objective #5
    • What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner-occupied homes?
  • Summary

Set Up¶

Import Libraries¶

Load the necessary libraries and surpress warnings.

In [1]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
In [2]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import plotly.express as px
import plotly.graph_objects as go
import matplotlib as mpl
import matplotlib.pyplot as plt 
%matplotlib inline

Import Data¶

Load the real estate dataset and drop the first column.

In [3]:
# Load data
boston_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv'
boston_df = pd.read_csv(boston_url)
boston_df.drop('Unnamed: 0', axis=1, inplace=True) #remove unused variable

Add a new variable to represent the full value of owner-occupied homes.

In [4]:
# Calculate variable for full value
boston_df["MEDV_units"] = boston_df["MEDV"]*1000

Data Exploration¶

Find the size, variable types, and descriptive statistics of the dataset.

In [5]:
# Find the size of dataset
print("There are currently", boston_df.shape[0], "rows and", boston_df.shape[1], "variables in this dataset.")
There are currently 506 rows and 14 variables in this dataset.
In [6]:
# Load the first five rows of the dataset
boston_df.head()
Out[6]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV MEDV_units
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 4.98 24.0 24000.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 9.14 21.6 21600.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 4.03 34.7 34700.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 2.94 33.4 33400.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 5.33 36.2 36200.0
In [7]:
# Find the variable types
boston_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   CRIM        506 non-null    float64
 1   ZN          506 non-null    float64
 2   INDUS       506 non-null    float64
 3   CHAS        506 non-null    float64
 4   NOX         506 non-null    float64
 5   RM          506 non-null    float64
 6   AGE         506 non-null    float64
 7   DIS         506 non-null    float64
 8   RAD         506 non-null    float64
 9   TAX         506 non-null    float64
 10  PTRATIO     506 non-null    float64
 11  LSTAT       506 non-null    float64
 12  MEDV        506 non-null    float64
 13  MEDV_units  506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB
In [8]:
# Descriptive statistics for all variables
boston_df.describe()
Out[8]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV MEDV_units
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 12.653063 22.532806 22532.806324
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 7.141062 9.197104 9197.104087
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 1.730000 5.000000 5000.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 6.950000 17.025000 17025.000000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 11.360000 21.200000 21200.000000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 16.955000 25.000000 25000.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 37.970000 50.000000 50000.000000

Objective #1: visualize the median value of owner-occupied homes using two different libraries for plotting.¶

Boxplots are created to assess the price distribution of owner-occupied homes. Boxplots will be plotted using two different Python libraries, Seaborn and Plotly.

Method #1: a static visualization with Seaborn¶

In [9]:
# Graphing a boxplot with Seaborn
Q1_fig = sns.boxplot(data=boston_df, 
                 y="MEDV_units", 
                 linecolor="black", 
                 palette="Blues")
Q1_fig.set(ylabel = "Price", title ='Median Values of Owner-Occupied Homes')
Q1_fig.get_yaxis().set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}')) # format y-axis price display

plt.show()
No description has been provided for this image

Method #2: an interactive visualization with Plotly Express¶

In [10]:
# Graphing a boxplot with Plotly Express
Q1_fig2 = px.box(boston_df, 
             y="MEDV_units",
             width = 500,
             height = 600,
             title = "Median Values of Owner-Occupied Homes",
             labels={"MEDV_units": "Price"},
             template="simple_white")
Q1_fig2.update_layout(title_x=0.5)
Q1_fig2.update_layout(margin=dict(l=30, r=30, t=50, b=20))
Q1_fig2.show()

From the boxplots, it seems like there is a large spread of house values, with the median value of houses appearing to be slightly over $20,000. Descriptive statistics can be calculated to provide additional information on the distribution of house values.

In [11]:
# Print descriptive statistics of house values
print("\nBasic descriptive statistics can seen here:")
funcs = {'average': np.mean, 
         'standard deviation': np.std,
         'median': np.median,
         'lowest': np.min,
         'highest': np.max}

for key, value in funcs.items():
    print("\tThe ", key," value is ${:,.2f}".format(boston_df['MEDV_units'].aggregate(value)), ".", sep="")

# print descriptive statistics of house values in dataframe
print("\nAdditional descriptive statistics:\n")
MEDV_table = pd.DataFrame(boston_df["MEDV_units"].describe().round(2)).style.format('{:,.2f}')
MEDV_table.columns = ["Median Value of Homes"]
MEDV_table
Basic descriptive statistics can seen here:
	The average value is $22,532.81.
	The standard deviation value is $9,197.10.
	The median value is $21,200.00.
	The lowest value is $5,000.00.
	The highest value is $50,000.00.

Additional descriptive statistics:

Out[11]:
  MEDV_units
count 506.00
mean 22,532.81
std 9,197.10
min 5,000.00
25% 17,025.00
50% 21,200.00
75% 25,000.00
max 50,000.00

From the boxplots and descriptive statistics, the range of house values in this dataset span from $5,000.00 to $50,000.00, with the median value at $21,200 and the mean value at $22,532.81.


Objective #2: is there a significant difference in the median value of houses bounded by the Charles River?¶

Visualize real estate by the Charles River¶

Bar charts are first created to compare the number of houses bounding and not bounding the Charles River. We can rename the types of housing for ease of reading and calculate the total for each kind.

In [12]:
# Create a variable to display the binary values as strings
boston_df.loc[(boston_df['CHAS'] == 0), 'CHAS_STRING'] = 'House Does Not Bound River'
boston_df.loc[(boston_df['CHAS'] == 1), 'CHAS_STRING'] = 'House Bounds River'

# Create a dataframe for housing counts
CHAS_counts = boston_df.CHAS_STRING.value_counts().to_frame().reset_index()
print("\nNumber of houses that do not bound the Charles River:", CHAS_counts.iloc[0,1])
print("Number of houses bound the Charles River:", CHAS_counts.iloc[1,1],"\n")
Number of houses that do not bound the Charles River: 471
Number of houses bound the Charles River: 35 

We can then graph the findings with two methods, one static and one interactive. Houses that are not surrounding the river are represented in orange while houses located by the river are shown in blue.

Method #1: a static visualization with Matplotlib¶

In [13]:
# Graphing a bar chart with Matplotlib
plt.bar(boston_df.CHAS_STRING.unique(), 
        boston_df.CHAS_STRING.value_counts(),
        color=['orange','blue'])
plt.xlabel('\nType of House')
plt.ylabel('Count')
plt.title('Number of Houses by the Charles River')
plt.show()
No description has been provided for this image

Method #2: an interactive visualization with Plotly Express¶

In [14]:
# Graphing a bar chart with Plotly Express
Q2_fig2 = px.bar(CHAS_counts, 
             x="CHAS_STRING", 
             y="count",
             width = 700,
             height = 600,
             color="CHAS_STRING",
             title="Number of Houses by the Charles River",
             labels={"CHAS_STRING":"Type of House",
                    "count":"Count"},
             color_discrete_sequence=["orange", "blue"],
             template="simple_white")
Q2_fig2.update_yaxes(range=[0, CHAS_counts.iloc[0,1].round(-2) ]) # automatically adjust 
Q2_fig2.update_layout(title_x=0.5)
Q2_fig2.update_layout(margin=dict(l=30, r=30, t=50, b=20))
Q2_fig2.show()

From both these plots, we can see there are more houses that do not bound the Charles River than those that bound it.

A boxplot can be used to visualize any potential price differences between these different types of houses. As with the above bar chart, houses that are not located by the river are represented in orange while houses next to the river are shown in blue.

In [15]:
# Graphing a boxplot with Plotly Express
Q2_fig3 = px.box(boston_df, 
             x="CHAS_STRING",
             y="MEDV_units",
             width = 700,
             height = 600,
             title ='Housing Price Differences Based on Bounding the Charles River',
             labels={
                 "MEDV_units": "Price", 
                 "CHAS_STRING":"Type of House"},
             color="CHAS_STRING",
             color_discrete_sequence=["orange", "blue"],
             template="simple_white")
Q2_fig3.update_layout(title_x=0.5)
Q2_fig3.update_layout(margin=dict(l=30, r=30, t=50, b=20))
Q2_fig3.show()

The descriptive statistics for each group's housing values are as follows:

In [16]:
# Descriptive statistics
CHAS_STRING_table = boston_df.groupby("CHAS_STRING").aggregate("MEDV_units").describe().round(2).style.format('{:,.2f}')
CHAS_STRING_table.index.name = "Type of House"
CHAS_STRING_table
Out[16]:
  count mean std min 25% 50% 75% max
Type of House                
House Bounds River 35.00 28,440.00 11,816.64 13,400.00 21,100.00 23,300.00 33,150.00 50,000.00
House Does Not Bound River 471.00 22,093.84 8,831.36 5,000.00 16,600.00 20,900.00 24,800.00 50,000.00

From the boxplot, the mean value of houses that do not bound the Charles River seem to be relatively lower than the mean value of houses that bound the river. The descriptive statistics also support this.

Statistical analysis¶

A t-test for independent samples is conducted to determine whether there is a significant difference in the median value of houses depending on their location to the Charles River.

  • Null hypothesis: there is no difference in the median value of houses bounded by the Charles River.
  • Alternative hypothesis: there is a difference in the median value of houses bounded by the Charles River.

Levene's test is first conducted to assess whether the variances between groups are equal.

In [17]:
# Levene's test to evaluate whether variances between groups are equal
Levene_test = scipy.stats.levene(boston_df[boston_df['CHAS'] == 0]['MEDV_units'],
                   boston_df[boston_df['CHAS'] == 1]['MEDV_units'], center='mean')

print("\nLevene's test results:")
print("\t", Levene_test, sep="")

print("\nLevene's test results in a p-value of ", Levene_test[1].round(3),
      ", which suggests the variances between groups are not equal. As such, the t-test for independent samples will be conducted with unequal variances.\n", sep="")
Levene's test results:
	LeveneResult(statistic=8.751904896045989, pvalue=0.003238119367639829)

Levene's test results in a p-value of 0.003, which suggests the variances between groups are not equal. As such, the t-test for independent samples will be conducted with unequal variances.

In [18]:
# T-test
ttest_var = scipy.stats.ttest_ind(boston_df[boston_df['CHAS'] == 0]['MEDV_units'],
                   boston_df[boston_df['CHAS'] == 1]['MEDV_units'], equal_var = False)
print("\nResults of a t-test with unequal variances:")
print("\t", ttest_var, "\n", sep="")

if ttest_var[1].round(3) < 0.05:
    print("The t-test results in a p-value of ", ttest_var[1].round(3), ". Since the p-value is less than α = 0.05, we reject the null hypothesis.", sep="")
else:
    print("The t-test results in a p-value of ", ttest_var[1].round(3), ". Since the p-value is greater than α = 0.05, we fail to reject the null hypothesis.", sep="")
Results of a t-test with unequal variances:
	TtestResult(statistic=-3.11329131279484, pvalue=0.0035671700981374896, df=36.87640879761199)

The t-test results in a p-value of 0.004. Since the p-value is less than α = 0.05, we reject the null hypothesis.

Conclusion¶

This test suggests there is a significant difference in the median value of houses bounded and not bounded by the Charles River.


Objective #3: is there a significant difference in median values of houses for each proportion of owner-occupied units built before 1940?¶

The following graph and analyses compare the median price of owner-occupied homes built before 1940 by their age, which is discretized into three groups (houses 35 years and younger, houses between 35 and 70 years, and houses that are 70 years and older).

The age of the units built before 1940 will first be partitioned into three groups for analysis.

In [19]:
# Discretize age groups
boston_df.loc[(boston_df['AGE'] <= 35), 'age_group'] = '35 Years and Younger'
boston_df.loc[(boston_df['AGE'] > 35)&(boston_df['AGE'] < 70), 'age_group'] = 'Between 35 and 70 Years'
boston_df.loc[(boston_df['AGE'] >= 70), 'age_group'] = '70 Years and Older'

Visualize the median values of homes built before 1940 by their age¶

A boxplot is used to compare the price distribution of owner-occupied homes by their age.

In [20]:
# Plotting with Seaborn
sns.reset_orig()
Q3_fig1 = sns.boxplot(data=boston_df, x="age_group", y="MEDV_units", palette="Blues")
Q3_fig1.set(xlabel="\nProportion of Owner-Occupied Units by Home Age", ylabel = "Value", title ='Median Values of Owner-Occupied Homes By Age')
Q3_fig1.get_yaxis().set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
sns.set(rc={'figure.figsize':(13,10)})
plt.show()

print("\nDescriptive statistics of each group's housing price are as follows:\n")
boston_df.groupby("age_group").aggregate("MEDV_units").describe().round(2)
No description has been provided for this image
Descriptive statistics of each group's housing price are as follows:

Out[20]:
count mean std min 25% 50% 75% max
age_group
35 Years and Younger 91.0 27775.82 7638.20 17100.0 23050.0 24800.0 31150.0 50000.0
70 Years and Older 287.0 19793.38 9515.38 5000.0 13800.0 18200.0 22550.0 50000.0
Between 35 and 70 Years 128.0 24947.66 6969.37 10200.0 20675.0 22600.0 27425.0 50000.0

From the boxplot, it seems houses that are 70 years and older have relatively lower values, while houses that are 35 years and younger have relatively higher values. The descriptive statistics of each age group supports this perceived price difference.

Statistical analysis¶

An ANOVA is conducted to determine whether there is a significant difference in the median price of houses for each proportion of owner-occupied units built before 1940.

  • Null hypothesis: there is no difference in the median value of houses for each proportion of owner-occupied units built prior to 1940.
  • Alternative hypothesis: there is a difference in the median value of houses for each proportion of owner-occupied units built prior to 1940.
In [21]:
# ANOVA
ANOVA_group1 = boston_df[boston_df['age_group'] == '35 Years and Younger']['MEDV_units']
ANOVA_group2 = boston_df[boston_df['age_group'] == 'Between 35 and 70 Years']['MEDV_units']
ANOVA_group3 = boston_df[boston_df['age_group'] == '70 Years and Older']['MEDV_units']

ANOVA_var = scipy.stats.f_oneway(ANOVA_group1, ANOVA_group2, ANOVA_group3)

print("\nANOVA results:")
print("\t",ANOVA_var,sep="")

print("\nAn ANOVA results in a p-value of {:0.3e}".format(ANOVA_var[1]), ". Since this p-value is less than α = 0.05, we reject the null hypothesis.\n", sep="")
ANOVA results:
	F_onewayResult(statistic=36.40764999196602, pvalue=1.7105011022701769e-15)

An ANOVA results in a p-value of 1.711e-15. Since this p-value is less than α = 0.05, we reject the null hypothesis.

Since the ANOVA above is significant, a post hoc analysis using Tukey's HSD test to correct for multiple comparisons is then conducted to evaluate whether differences between group means are significant.

In [22]:
# Run post hoc analysis to determine significantly different pairs 
print(pairwise_tukeyhsd(boston_df['MEDV_units'], boston_df['age_group']))
                    Multiple Comparison of Means - Tukey HSD, FWER=0.05                     
============================================================================================
       group1                 group2          meandiff  p-adj     lower      upper    reject
--------------------------------------------------------------------------------------------
35 Years and Younger      70 Years and Older -7982.4444    0.0 -10418.1857 -5546.7031   True
35 Years and Younger Between 35 and 70 Years -2828.1679 0.0447  -5604.3202   -52.0157   True
  70 Years and Older Between 35 and 70 Years  5154.2765    0.0   3002.3619   7306.191   True
--------------------------------------------------------------------------------------------

The results from the posthoc Tukey's HSD test suggests that the price of all three groups is significantly different from one another. The average median price of houses 35 years and younger is the highest of the three groups, followed by houses between 35 and 70 years. Hhouses that are 70 years and older had the lowest average median price of the three groups.

Conclusion¶

An ANOVA test suggests there is a significant difference in the median value of houses for each proportion of owner-occupied homes built prior to 1940. The average median price of houses 35 years and younger is the highest of the three groups while the average median price of houses 70 years and older is the lowest of the three groups. That is, houses differ in their value based on age, with younger houses tending to be more expensive than older houses.


Objective #4: is there a relationship between nitric oxide concentrations and the proportion of non-retail business acres per town?¶

Assessing the dataset shows that there are several duplicated values for the "INDUS" and "NOX" variables. This means that the same values for the proportion of non-retail business acres per town and nitric oxide concentrations were reported several times, which may prevent accurate results if the dataset is used without proper formatting. To resolve this, rows with duplicate values in the variable are discarded so that only rows with unique values remain.

In [23]:
# Retain only unique values of "INDUS"
boston_df_no_INDUS_duplicates = boston_df.drop_duplicates(subset=["INDUS"])
In [24]:
# Find size of dataset
print('The new dataset includes', boston_df_no_INDUS_duplicates.shape[0], 'rows and', boston_df_no_INDUS_duplicates.shape[1], 'columns.')
The new dataset includes 76 rows and 16 columns.

Visualize the relationship between nitric oxide concentrations and the proportion of non-retail business acres per town.¶

A scatterplot is created to assess the relationship between the two variables.

In [25]:
# Plotting with Seaborn
sns.reset_orig()
Q4_fig = sns.scatterplot(x="INDUS", y="NOX", data=boston_df_no_INDUS_duplicates)
Q4_fig.set(xlabel="Proportion of Non-Retail Business Acres per Town", 
       ylabel = "Nitric Oxide Concentration\n(Parts per 10 Million)", 
       title ='Relationship Between Non-Retail Business Acres per Town and Nitric Oxide Concentrations')
plt.show()
No description has been provided for this image

There seems to be a positive, linear relationship between the proportion of non-retail business acres per town and the nitric oxide concentrations. Greater proportions of non-retail businesses seems to be related to greater nitric oxide concentrations.

Statistical analysis¶

A Pearson Correlation is conducted to determine whether there is a significant relationship between nitric oxide concentrations and the proportion of non-retail business acres per town.

  • Null hypothesis: there is no relationship between nitric oxide concentrations and the proportion of non-retail business acres per town.
  • Alternative hypothesis: there is a relationship between nitric oxide concentrations and the proportion of non-retail business acres per town.
In [26]:
# Pearson Correlation
pearsonr_var = scipy.stats.pearsonr(boston_df_no_INDUS_duplicates['INDUS'], boston_df_no_INDUS_duplicates['NOX'])

print("\nPearson Correlation results:")
print("\t",pearsonr_var,sep="")

print("\nA Pearson Correlation results in a p-value of {:0.3e}".format(pearsonr_var[1]), ".", sep="")
print("Since this p-value is less than α = 0.05, we reject the null hypothesis.\n")
Pearson Correlation results:
	PearsonRResult(statistic=0.6809525800688366, pvalue=1.3018316252583866e-11)

A Pearson Correlation results in a p-value of 1.302e-11.
Since this p-value is less than α = 0.05, we reject the null hypothesis.

Conclusion¶

This test suggests there is a significant relationship between nitric oxide concentrations and the proportion of non-retail business acres per town. Nitric oxide concentrations and the proportion of non-retail business acres is positively correlated, with increases in one variable associated with increases in the other.


Objective #5: what is the impact of an additional weighted distance to the five Boston employment centers on the median value of owner-occupied homes?¶

Statistical analysis¶

A linear regression is conducted to determine whether there is a significant relationship between weighted distance to the Boston employment centers and median price of owner-occupied homes.

  • Null hypothesis: there is no relationship between the weighted distance to the five Boston employment centers and the median value of owner-occupied homes.
  • Alternative hypothesis: there is a relationship between the weighted distance to the five Boston employment centers and the median value of owner-occupied homes.
In [27]:
X = boston_df["DIS"]
y = boston_df["MEDV_units"]
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

print("\nRegression results:")
print(model.summary())

if model.pvalues[1].round(3) < 0.05:
    print("\nThe regression results in a p-value of {:0.3e}".format(model.pvalues[1]), 
          "for the independent variable. Since the p-value is less than α = 0.05, we reject the null hypothesis.", sep="")
else:
    print("\nThe regression results in a p-value of {:0.3e}".format(model.pvalues[1]), 
          "for the independent variable. Since the p-value is greater than α = 0.05, we fail to reject the null hypothesis.", sep="")

print("\nFor every additional 1 unit of weighted distance to the five Boston employment centers, the median house price increases on average by ${:0.2f}".format(model.params[1]),
      ".\n",sep="")
Regression results:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             MEDV_units   R-squared:                       0.062
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     33.58
Date:                Wed, 05 Mar 2025   Prob (F-statistic):           1.21e-08
Time:                        06:18:16   Log-Likelihood:                -5319.2
No. Observations:                 506   AIC:                         1.064e+04
Df Residuals:                     504   BIC:                         1.065e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.839e+04    817.389     22.499      0.000    1.68e+04       2e+04
DIS         1091.6130    188.378      5.795      0.000     721.509    1461.717
==============================================================================
Omnibus:                      139.779   Durbin-Watson:                   0.570
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              305.104
Skew:                           1.466   Prob(JB):                     5.59e-67
Kurtosis:                       5.424   Cond. No.                         9.32
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression results in a p-value of 1.207e-08for the independent variable. Since the p-value is less than α = 0.05, we reject the null hypothesis.

For every additional 1 unit of weighted distance to the five Boston employment centers, the median house price increases on average by $1091.61.

Conclusion¶

This test suggests there is a significant relationship between the weighted distance to the five Boston employment centers and the median value of owner-occupied homes. Greater distances to the employment centers was associated with greater increases in housing prices.


Summary¶

  • The house values in this dataset had a large range ($5,000.00 to $50,000.00), with an average valuation of $22,532.81.
  • However, the location of the house impacted its value. There were fewer houses that bounded the Charles River, but they were significantly more expensive than houses located farther away from the river.
  • The age of a house also significantly impacted its price as older houses tended to be less valuable.
  • Air quality was worse in areas with more non-retail business activity: there was a significant positive correlation between nitric oxide concentrations and the proportion of non-retail business acres per town, such that more business acres was associated with greater nitric oxide concentrations.
  • The distance from housing to employment areas was also a consideration for price. There was a significant positive relationship between distance to the Boston employment centers and housing values where greater distances was related to higher house prices.

Notebook sources:

  • IBM Corporation
  • Coursera